library(tidyverse)
# library(broom)
library(patchwork)
library(scales)
library(dagitty)
library(ggdag)
library(latex2exp) # Easily convert LaTeX into arcane plotmath expressions
library(ggtext) # Use markdown in ggplot labels
# Create a cleaner serifed theme to use throughout
theme_do_calc <- function() {
theme_dag(base_family = "Times New Roman") +
theme(plot.title = element_text(size = rel(1.5)),
plot.subtitle = element_markdown())
}
# Make all geom_dag_text() layers use these settings automatically
update_geom_defaults(ggdag:::GeomDagText, list(family = "Times New Roman",
fontface = "bold",
color = "black"))
In all the examples that follow, we are trying to estimate a causal
relationship between x and y. I think this is
\(P(Y| \mbox{do}(X=x))\) NEED
TO CHECK THIS
This represents a sequential path from intervention to outcome via a
confounder (x → z → y). Can bias
causal link between x and y
chain_dag <- dagify(z ~ x,
y ~ z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(chain_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(chain_dag)
## x _||_ y | z
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(chain_dag) +
theme_dag()
ggdag_dseparated(chain_dag) +
theme_dag()
ggdag_dseparated(chain_dag, controlling_for = "z") +
theme_dag()
Simulated data - Equations
x: \(N(5, 1)\)x ~ z: \(\beta_1 =
4\)y ~ z: \(\beta_2 =
2\)n <- 1000
x <- rnorm(n, 5, 1)
z <- x * 4 + rnorm(n, 0, 2)
y <- z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
spurious relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2218 -3.0478 0.0459 2.9456 12.5710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3674 0.7140 -1.915 0.0558 .
## x 8.2846 0.1408 58.843 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.435 on 998 degrees of freedom
## Multiple R-squared: 0.7763, Adjusted R-squared: 0.776
## F-statistic: 3462 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and removes the apparent
relationship. Note that the coefficient between z and
y is correct.
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8294 -1.4598 -0.0431 1.4187 7.6018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.75773 0.33720 -2.247 0.0249 *
## x 0.32995 0.15031 2.195 0.0284 *
## z 1.95797 0.03318 59.005 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.093 on 997 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9501
## F-statistic: 9510 on 2 and 997 DF, p-value: < 2.2e-16
chain_dag <- dagify(z ~ x,
y ~ x + z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(chain_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(chain_dag)
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(chain_dag) +
theme_dag()
Simulated data - Equations
x: \(N(5, 1)\)x ~ z: \(\beta_1 =
4\)y ~ x + z: \(\beta_2 = -1.5,
\beta_3 = 2\)n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 2)
y <- x * -1.5 + z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
biased relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6320 -1.4431 0.0286 1.4197 6.7193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006578 0.146668 0.045 0.964
## x -1.100033 0.002680 -410.433 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.149 on 998 degrees of freedom
## Multiple R-squared: 0.9941, Adjusted R-squared: 0.9941
## F-statistic: 1.685e+05 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and results in correct
coefficient estimation.
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7522 -1.3014 0.0372 1.4069 6.1381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05084 0.13625 -0.373 0.709
## x -1.49889 0.03155 -47.505 <2e-16 ***
## z 2.00319 0.15797 12.681 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.995 on 997 degrees of freedom
## Multiple R-squared: 0.9949, Adjusted R-squared: 0.9949
## F-statistic: 9.779e+04 on 2 and 997 DF, p-value: < 2.2e-16
These occur when a confounding variable impacts both the intervention
(x) and the outcome (Y). Can result in a
spurious correlations (i.e. x and y will be
correlated without existing causality).
This has no causal link between x and y,
and a single confounder (z)
fork_dag <- dagify(x ~ z,
y ~ z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(fork_dag)
## x _||_ y | z
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(fork_dag) +
theme_dag()
Simulated data - Equations
z: \(N(10, 5)\)x ~ z: \(\beta_1 =
5\)y ~ z: \(\beta_2 =
2\)n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 2)
y <- z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
spurious relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6588 -1.4217 0.0332 1.4669 7.3531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.036410 0.152095 0.239 0.811
## x 0.399385 0.002756 144.920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.172 on 998 degrees of freedom
## Multiple R-squared: 0.9546, Adjusted R-squared: 0.9546
## F-statistic: 2.1e+04 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and removes the apparent
relationship
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9982 -1.3117 0.0697 1.3173 5.9398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11968 0.14047 -0.852 0.394
## x -0.02330 0.03151 -0.739 0.460
## z 2.12496 0.15790 13.458 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.999 on 997 degrees of freedom
## Multiple R-squared: 0.9616, Adjusted R-squared: 0.9615
## F-statistic: 1.249e+04 on 2 and 997 DF, p-value: < 2.2e-16
This is the same graph as before, but now we include a real
relationship between x and y
fork_dag <- dagify(x ~ z,
y ~ x + z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence: Now x and y have no
conditional independence
impliedConditionalIndependencies(fork_dag)
Adjustment set:
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z: \(N(10, 5)\)x ~ z: \(\beta_1 =
5\)y ~ x + z: \(\beta_2 =
-4\), \(\beta_3 = 2\)n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 5)
y <- x * -4 + z * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Without controlling for the confounder,
the coefficient relating x and y is biased
(the relationship exists, but it’s biased)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6047 -3.8278 0.1458 3.8001 17.0101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.360705 0.352867 1.022 0.307
## x -3.610218 0.006354 -568.171 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.316 on 998 degrees of freedom
## Multiple R-squared: 0.9969, Adjusted R-squared: 0.9969
## F-statistic: 3.228e+05 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. Now we get the right value for
the coefficient (\(\sim4\))
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6118 -3.2955 -0.0864 3.5797 15.0722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3900 0.3343 -1.167 0.244
## x -3.9966 0.0318 -125.686 <2e-16 ***
## z 2.0040 0.1620 12.368 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.952 on 997 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9973
## F-statistic: 1.861e+05 on 2 and 997 DF, p-value: < 2.2e-16
This is to illustrate Pearl’s point about uncessary confounders. This
graph has a backdoor path with two confounding variables
(z1 and z2). In order to test a causal link
between x and y, we only need to
control for one, as either will block the back door path
fork_dag <- dagify(x ~ z1,
z2 ~ z1,
y ~ z2,
coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence:
impliedConditionalIndependencies(fork_dag)
## x _||_ y | z2
## x _||_ y | z1
## x _||_ z2 | z1
## y _||_ z1 | z2
Adjustment set: Now we get two adjustment sets, as controlling for
either z1 or z2 removes the path
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z1: \(N(10, 5)\)x ~ z1: \(\beta_1 =
5\)y ~ z2: \(\beta_2 =
2\)z2 ~ z1: \(beta_3 =
-4\)n <- 1000
z1 <- rnorm(n, 10, 5)
x <- z1 * 5 + rnorm(n, 0, 5)
z2 <- z1 * -4 + rnorm(n, 0, 5)
y <- z2 * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z1, z2)
pairs(df)
Model of y ~ x (shows spurious relationship)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.614 -9.764 -0.455 9.385 51.687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.97585 0.97257 -3.06 0.00227 **
## x -1.53535 0.01734 -88.53 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.8 on 998 degrees of freedom
## Multiple R-squared: 0.8871, Adjusted R-squared: 0.8869
## F-statistic: 7838 on 1 and 998 DF, p-value: < 2.2e-16
Control for z1
fit1 <- lm(y ~ x + z1, df)
summary(fit1)
##
## Call:
## lm(formula = y ~ x + z1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.088 -7.864 0.157 7.697 38.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40923 0.83158 0.492 0.623
## x -0.04147 0.07402 -0.560 0.575
## z1 -7.81078 0.37948 -20.583 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.56 on 997 degrees of freedom
## Multiple R-squared: 0.9207, Adjusted R-squared: 0.9206
## F-statistic: 5790 on 2 and 997 DF, p-value: < 2.2e-16
Control for z2
fit2 <- lm(y ~ x + z2, df)
summary(fit2)
##
## Call:
## lm(formula = y ~ x + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8357 -3.7067 0.0417 3.5356 15.5855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04009 0.36113 0.111 0.912
## x 0.01185 0.02049 0.579 0.563
## z2 2.01444 0.02533 79.513 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.094 on 997 degrees of freedom
## Multiple R-squared: 0.9846, Adjusted R-squared: 0.9846
## F-statistic: 3.19e+04 on 2 and 997 DF, p-value: < 2.2e-16
Control for both (this works, but is more costly in regression terms)
fit3 <- lm(y ~ x + z1 + z2, df)
summary(fit3)
##
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2111 -3.6316 0.0424 3.4860 15.0231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12908 0.36529 -0.353 0.72390
## x -0.05705 0.03251 -1.755 0.07956 .
## z1 0.57546 0.21124 2.724 0.00656 **
## z2 2.06802 0.03201 64.604 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.078 on 996 degrees of freedom
## Multiple R-squared: 0.9847, Adjusted R-squared: 0.9847
## F-statistic: 2.141e+04 on 3 and 996 DF, p-value: < 2.2e-16
This is to illustrate Pearl’s point about uncessary confounders. This
graph has a backdoor path with two confounding variables
(z1 and z2). In order to test a causal link
between x and y, we only need to
control for one, as either will block the back door path. This also adds
a path between x and y
fork_dag <- dagify(x ~ z1,
z2 ~ z1,
y ~ x + z2,
coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence:
impliedConditionalIndependencies(fork_dag)
## x _||_ z2 | z1
## y _||_ z1 | x, z2
Adjustment set: Now we get two adjustment sets, as controlling for
either z1 or z2 removes the path
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z1: \(N(10, 5)\)x ~ z1: \(\beta_1 =
5\)y ~ x + z2: \(\beta_2 = 1.5;
\beta_3 = 2\)z2 ~ z1: \(beta_4 =
-4\)n <- 1000
z1 <- rnorm(n, 10, 5)
x <- z1 * 5 + rnorm(n, 0, 5)
z2 <- z1 * -4 + rnorm(n, 0, 5)
y <- x * 1.5 + z2 * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z1, z2)
pairs(df)
Model of y ~ x (biased coefficient, only weakly
significant)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.905 -8.940 -0.692 9.559 47.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.52003 0.94845 -2.657 0.00801 **
## x -0.05071 0.01683 -3.013 0.00265 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.77 on 998 degrees of freedom
## Multiple R-squared: 0.009013, Adjusted R-squared: 0.00802
## F-statistic: 9.077 on 1 and 998 DF, p-value: 0.002653
Control for z1
fit1 <- lm(y ~ x + z1, df)
summary(fit1)
##
## Call:
## lm(formula = y ~ x + z1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.411 -7.670 0.018 7.818 36.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42923 0.79373 0.541 0.589
## x 1.48187 0.07197 20.591 <2e-16 ***
## z1 -7.95320 0.36645 -21.703 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.36 on 997 degrees of freedom
## Multiple R-squared: 0.327, Adjusted R-squared: 0.3256
## F-statistic: 242.2 on 2 and 997 DF, p-value: < 2.2e-16
Control for z2
fit2 <- lm(y ~ x + z2, df)
summary(fit2)
##
## Call:
## lm(formula = y ~ x + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8806 -3.5402 0.1354 3.3948 13.7955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17984 0.34564 0.52 0.603
## x 1.49420 0.01999 74.75 <2e-16 ***
## z2 1.99884 0.02463 81.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.996 on 997 degrees of freedom
## Multiple R-squared: 0.8697, Adjusted R-squared: 0.8695
## F-statistic: 3329 on 2 and 997 DF, p-value: < 2.2e-16
Control for both (this works, but is more costly in regression terms)
fit3 <- lm(y ~ x + z1 + z2, df)
summary(fit3)
##
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0223 -3.4918 0.1094 3.3704 13.7611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13739 0.34927 0.393 0.694
## x 1.47331 0.03166 46.528 <2e-16 ***
## z1 0.17418 0.20469 0.851 0.395
## z2 2.01523 0.03127 64.451 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 996 degrees of freedom
## Multiple R-squared: 0.8698, Adjusted R-squared: 0.8694
## F-statistic: 2219 on 3 and 996 DF, p-value: < 2.2e-16
This is the second type of path junction that can cause problems.
Here, our two variables (x and y) are
independent, but are both linked to z, the collider. Ths
issue with this relationship is the bias that results when we
control for the collider
collider_dag <- dagify(z ~ x,
z ~ y,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(collider_dag) +
theme_dag()
Independence: In the case, there is no conditional independence.
x and y are unconditionally independent
impliedConditionalIndependencies(collider_dag)
## x _||_ y
Adjustment set: now we do not need to adjust our model
ggdag_adjustment_set(collider_dag) +
theme_dag()
Separation: This is a test for \(d\)-separation in a DAG. \(d\)-separation indicates that two variables are independent
ggdag_dseparated(collider_dag) +
theme_dag()
We can also test for \(d\)-separation when we control for
z. THis now shows that a backdoor path has been opened by
the control, resulting in a spurious link between x and
y (MAYBE ADD DSEP TO CONFOUNDER
EXAMPLES?)
ggdag_dseparated(collider_dag, controlling_for = "z") +
theme_dag()
Equations
x: \(N(10, 2)\)y: \(N(5, 1)\)z ~ x + y: \(\beta_1 =
5\), \(\beta_2 = 2\)n <- 1000
x <- rnorm(n, 10, 2)
y <- rnorm(n, 5, 1)
z <- 5 * x + 2 * y + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x - this returns the unbiased link (or lack
thereof) between x and y given the
collider
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2297 -0.6825 -0.0230 0.6181 3.5585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.84964 0.16005 30.301 <2e-16 ***
## x 0.01775 0.01554 1.142 0.254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.022 on 998 degrees of freedom
## Multiple R-squared: 0.001305, Adjusted R-squared: 0.0003045
## F-statistic: 1.304 on 1 and 998 DF, p-value: 0.2537
Model controlling for z. This now results in a spurious
relationship between x and y
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34854 -0.50334 -0.01152 0.46973 2.05116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.29203 0.13751 16.67 <2e-16 ***
## x -1.25069 0.04092 -30.57 <2e-16 ***
## z 0.25375 0.00789 32.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7161 on 997 degrees of freedom
## Multiple R-squared: 0.5098, Adjusted R-squared: 0.5089
## F-statistic: 518.5 on 2 and 997 DF, p-value: < 2.2e-16
collider_dag <- dagify(z ~ x,
z ~ y,
y ~ x,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(collider_dag) +
theme_dag()
Independence: In the case, there is no conditional independence.
x and y are unconditionally independent
impliedConditionalIndependencies(collider_dag)
Adjustment set: now we do not need to adjust our model
ggdag_adjustment_set(collider_dag) +
theme_dag()
Separation: This is a test for \(d\)-separation in a DAG. \(d\)-separation indicates that two variables are independent
ggdag_dseparated(collider_dag) +
theme_dag()
We can also test for \(d\)-separation when we control for
z. THis now shows that a backdoor path has been opened by
the control, resulting in a spurious link between x and
y (MAYBE ADD DSEP TO CONFOUNDER
EXAMPLES?)
ggdag_dseparated(collider_dag, controlling_for = "z") +
theme_dag()
Equations
x: \(N(10, 2)\)y: \(N(5, 1) + \beta_3
x\). \(\beta_3 = 1.5\)z ~ x + y: \(\beta_1 =
5\), \(\beta_2 = 2\)n <- 1000
x <- rnorm(n, 10, 2)
y <- rnorm(n, 5, 1) + 1.5 * x
z <- 5 * x + 2 * y + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x - this returns the unbiased coefficient
between x and y given the collider
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7750 -0.6836 0.0241 0.6996 3.1363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.76558 0.16324 29.19 <2e-16 ***
## x 1.52583 0.01596 95.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 998 degrees of freedom
## Multiple R-squared: 0.9016, Adjusted R-squared: 0.9015
## F-statistic: 9141 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This now results in a biased
coefficient between x and y
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.31029 -0.50096 -0.01734 0.47372 2.18996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.350176 0.142529 16.489 < 2e-16 ***
## x -0.522138 0.068753 -7.594 7.1e-14 ***
## z 0.254487 0.008422 30.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7445 on 997 degrees of freedom
## Multiple R-squared: 0.9486, Adjusted R-squared: 0.9485
## F-statistic: 9204 on 2 and 997 DF, p-value: < 2.2e-16